All scripts scraped from an online repository: http://www.chakoteya.net/DoctorWho/index.html
Ratings and runtimes retrieved from IMDB.
Information to match serialized parts with episodes, as well as information regarding writers and UK viewership numbers, taken from: https://en.wikipedia.org/wiki/List_of_Doctor_Who_episodes_(1963-1989)
Python code used for web crawling and initial data construction can be found at: https://github.com/LaurenceDyer/DocWho
Spanning roughly 60 years, Doctor Who is a collection of episodic, science fiction radio plays and television serials starring the eponymous “Doctor”, a humanoid alien from the planet Gallifrey. The Doctor explores the Universe, though mainly staying around/on Earth, with his longterm companions. Upon his death, The Doctor regenerates, and a new actor takes their place. As such, The Doctor is portrayed by no less than 13 actors over the shows’ history.
A 15-year intermission in production occurred between the years of 1990 and 2005, leading to many fans considering the series to be split between classic (Doctors 1 through 7) and modern runs (Doctors 9 through 13). Doctor 8 was portrayed only in a made-for-TV movie and as such is missing from this analysis.
The data set available is, accordingly, very large, with roughly 250000 lines of dialogue over roughly 330 episodes.
In this report we will perform a statistical analysis on the scripts of Doctor Who episodes aired betwen 1963 and 2022.
After viewing the initial input data, we see immediately that we must remedy the issue of classic episodes being recorded on IMDB according to their individualized parts/viewing slots, rather than as whole episodes. We’ll combine this data to generate single episodes of these parts, matching our script source and maintaining complete episode narratives for some of our downstream analysis, e.g. episode sentiment. The perform this using data scraped from our third source, Wikipedia.
#The most important information we need here is the number of parts of each episode
noParts <- ddply(classic_wiki, .(Title), .fun = nrow)
#...but we will also record data such as the mean viewers, calculated from the individual viewership of each "part".
viewers <- aggregate(`Viewers(millions)`~Title,classic_wiki, FUN = function(x) mean(as.numeric(x)))
#Flatten df and merge title, number of parts and viewers
classic_wiki <- classic_wiki[!duplicated(classic_wiki$Title),c(1,2,3,4,5,6)]
classic_wiki$Index <- c(1:length(classic_wiki$Title))
classic_wiki2 <- merge(merge(classic_wiki,viewers,by="Title"),noParts,by="Title")
classic_wiki2 <- classic_wiki2[order(classic_wiki2$Index),]
classic_wiki2$Episode <- as.numeric(classic_wiki2$Episode)
#Fix missing episodes - Some episodes are edgecases and have a peculiar episode number that differs between our sources, we must correct them
classic_wiki2[131,]$Episode <- 7
classic_wiki2 <- classic_wiki2[-130,c(1,2,3,5,6,7,8,9)]
#Correct season number - This information is easier to calculate than to import
seasonNo = 0
for(i in c(1:length(classic_wiki2$Title))){
if(classic_wiki2[i,]$Episode==1){
seasonNo = seasonNo+1
}
classic_wiki2[i,]$Season <- seasonNo
}
We need to assign a part vector to each episode in our ratings df.
#We generate a vector we will use to merge the rows of our IMBD ratings dataframe
partsVec <- c()
j <- 1
for(i in c(1:length(classic_wiki2$V1))){
partsVec <- c(partsVec,rep(j,classic_wiki2[i,]$V1))
j=j+1
}
classic_ratings$Parts <- partsVec
#Aggregate our data over each part, leaving us a single row for a full episode
classic_ratings$Runtime <- as.numeric(gsub(" minutes","",classic_ratings$Runtime))
classic_ratings[classic_ratings$Season==20 & classic_ratings$Episode==23,]$Runtime <- 90
classic_ratings_rat <- aggregate(Rating~Parts,classic_ratings,mean)
classic_ratings_run <- aggregate(Runtime~Parts,classic_ratings,sum)
And after a little more processing we can merge them together. Great! Done. We’ll perform a similar (But much simpler) process with our modern episodes until we create a dataframe with all the relevant information for each episode:
classic_ext <- merge(classic_wiki2,classic_ratings2,by="Series_Episode")
classic_ext <- classic_ext[,c(1:6,8,13,14)]
colnames(classic_ext)[3:4] <- c("Season","Episode")
The scripts that we are using are designed to be human readable, but do not necessarily lend themselves well to mass data analysis. They were also transcribed manually, and as such, many typos are present. The process of correcting for script-breaking typos (Such as typo’d dialogue syntax) has already been performed, however it is likely that among the 240,000 dialogue lines, an unknown number of artefacts exist.
The first major act of removing artefacts in our script lines has already been performed, largely in python. This process also involved the removal of several strings which defined stage direction cues or provided visual descriptions of events.
Let’s start with some very general data overviews to see if we can locate any major remaining artefacts.
Let’s take a quick look at the evolution of the data we’ve crawled from wikipedia and IMDB. We can explore how runtime, rating and viewership have changed over time.
When it comes to episode runtime, we see two clear trends, both that the serialised classic era episodes have a far greater variation in length than the more restrained modern era, and that episode lengths for the modern era are growing in length as the newer seasons stretch on.
When it comes to rating we see that “Orphan 55”, an episode relating to climate action is undoubtedly the least popular Dr. Who episode going. And, in fact, all 5 of the lowest rated episodes are from the latest 3 seasons of the show, starring the 13th Doctor. Viewership numbers have dropped accordingly, dipping below the previous lowest all time record held by “Battlefield”, among other episodes from the final classic season.
Utopia - The first episode of the multi-episode season 29 finale, holds the title of highest rated episode. The episode features the return of the long-time series villain “The Master”. “City of Death” aired at prime-time in the middle of a workers’ strike that would take ITV, the BBC’s main source of competition, off air for several weeks.
This all seems to agree with expectations and is not indicative of a large underlying error.
Let’s see which writers were the most popular over the series’ run:
Steven Moffat and Russell T. Davies prove to be some of the most popular writers by episode rating, while also being the longest lasting writers by episode number at 48 and 31 episode credits respectively. Again, we see nothing to indicate wide ranging errors.
To get a sense of how deep some of our script-input problems might run, we’ll need to examine the data and try to get an overview of the potential data cleaning we have to perform. Let’s start by examining our two most rigid columns, “Character” and “Location”.
| Character | Frequency | Location | Frequency |
|---|---|---|---|
| DOCTOR | 58665 | [Tardis] | 13713 |
| CLARA | 3635 | [Control room] | 4507 |
| JAMIE | 3374 | [Corridor] | 3532 |
| IAN | 2956 | [Laboratory] | 2509 |
| BRIGADIER | 2809 | [Spaceship] | 2473 |
| SARAH | 2808 | [Tunnel] | 2003 |
Looks reasonable. What’s more iconic than the doctor and his TARDIS?
However, we can assume that there may be many errors lying lower down this frequency list. Let’s see how many characters and locations appear only once - They are quite likely to be recorded in error.
Wow! That doesn’t look too bad at all. Of course, we have many locations that do appear only briefly in the show, such as “Great Wall of China 1904”. In our data source, locations are always bounded by “[” and ”]” so they are easy to pick out and rarely made in error.
Some background characters, particularly aliens, are often listed as “DALEK1” or “ZYGON2”. We would rather tabulate these characters together going forward, and aggregate all of these into their alien race, or profession, unless the character is specifically named.
For The Doctor, who may appeared numbered if, e.g., DOCTOR10 turns up in a flashback during a DOCTOR12 episode, and robots like “K9”, we’ll ignore this step.
Before:
##
## DALEK2 DALEK1 SARAH2 DALEK3 MONOID1
## 380 241 143 139 94
And after:
##
## DALEK MONOID SARAH DRAHVIN CYBERMAN
## 800 244 144 99 91
One thing we can note about the scripts is that we occasionally see characters speaking at once, i.e., “DOCTOR-AND-ROSE:”. We can either ignore these, or we can duplicate each script and location, creating a row for each character. Let’s give that second one a try and while we’re here, why don’t we run some quick analysis on those double lines as there are so few of them - Who speaks together most often?
In S30E10, the character SKY, once possessed, immediately repeats the words of those around her. Makes sense that her and the Dr. dominate these overlaps.
We ought to also correct for the fact that “DALEK” and “DALEKS” are roughly interchangeable for our purposes, so let’s see if we can’t combine these characters, as well as all the other species of aliens and professions that appear as both singular and plural.
We can easily tabulate the scripts to see which characters speak the most. Let’s start by looking at those characters which speak the most over the series’ 60-year run.
Because we have a total of 3005 characters, we’ll need to cut things down immediately for the plot to be interesting. Let’s try both with and without the Doctor, as we expect this character to dominate. We’ll take the top 50 in each case.
Pretty interesting, but we know that characters appear in wildly different numbers of episodes - The Doctor appears in 334 episodes, “MAN” appears in 131, while Clara only appears in 39! Let’s see the most verbose speakers again after normalizing for their episode number. Let’s also calculate a few other basic stats - A lot of the characters with the most lines appear in only one episode, so we’ll limit some of our plots just to them.
We can also calculate which characters have the most words per episode, and the most words per line (And the longest monologues).
POLO wins the most lines per episode - What a screenhog! Clocking in at
~330 lines in his only episode, Marco Polo had the most individual lines
of any character. The doctor doesn’t quite make the cut, but after
limiting to only repeat characters, we see that not only is the doctor
the most prolific character, in 99% of episodes, he’s also the most
verbose repeat character.
Words per line gives us some interesting responses - SINGER and MUSIC, ANDREWMARR, TV, NARRATOR and NEWSMAN are all big winners here, giving us long, uninterrupted lines of monologuing.
The doctors themselves, all 13 (12) of them. Over the course of the series, many actors and writers have taken a stab at writing the doctor. How has this changed over time? Are older doctors more verbose? Do newer doctors get more lines? Lets see if we can find any clear trends.
And let’s take a quick peek at which doctors were the most popular with viewers:
DOC10 - David Tennant takes the top spot, with none of the classic
doctors except DOC4 - Tom Baker coming close.
DOC13 - Jodie Whittaker, is certainly not a fan favourite, with the average DOC13 episode having a score that would be considered low for most previous docs.
One way to try to analyse whether or not a character, or combinations of characters, are popular, is to use a series of plots to examine correlations between the number of lines a character has and that episodes’ rating. Because of the vast amount of characters we could graph here, we’ll select some the doctor’s main companions from the modern series to see who is the most popular.
Okay! Interesting! We actually do see that an increasing number of lines
for Rose and Martha is associated with a lower episode rating. Of
course, this is no longer significant after multiple adjustment, but
perhaps represents an interesting trend.
One thing we can easily generate from our data is a list of all characters and the number of times that they occur together within the same scene.
By counting the number of these interactions, we can construct a network using the R package igraph which will attempt to draw a network connecting each character. For this analysis we’ll stick to using mainstay repeat characters with a large number of lines and we’ll perform the analysis split between the classic and modern eras, as we know that characters from the two do not interact (Except in some very, very rare cases). Potentially characters who are separate do share names, particularly characters like “MAN”, but also some repeat characters such as “HARRY” - This is a largely unavoidable problem and the only solution is extensive watching of each episode and manually editing the script files.
We can view such character-character interactions in different ways - We can use clustering, or we can attempt to view an overall network. Let’s see how both look.
This looks great! We can see all of the doctors and their mainstay companions stick together very closely.
Let’s see if we can use the number of interactions as weights to drive a network graph of each series using igraph.
We’ll colour each node by their relevant role - Yellow for the doctor, blue for companions, red for villains and green for miscellaneous characters that don’t quite fit into any other category.
The network analysis gives us some pretty interesting conclusions, such as doctors and their companions sticking together strongly, the two primary Villains are both very central to the series, with both the master and the daleks centrally connecting all of the doctors with eachother. We also see an increased presence of miscellaneous characters in the modern series, as well as signs that some companions connect individual doctors (As they are carried over from one transformation to another), such as Clara, Sarah and Rose.
If we take all of the series episodes together as one, we are a bit overwhelmed with characters. Perhaps we can achieve a more detailed graph utilising the external software package Gephi. Gephi will allow us to quickly perform community analysis for our characters and re-colour our communities accordingly, it also provides a nice GUI to play around with many graphical settings.
Here we will colour each community individually, the size of the connecting lines will be proportional to the total number of interactions between two characters and the size of each character name will be proportional to their Page Rank, a measure of cluster centrality.
Gephi Network (All Episodes)
This turned out to be a really effective way of visualising the over-arching character interactions of the series.
Firstly, we see the central role that repeat villains play through the series, with the DALEKs, the CYBERMEN, DAVROS and the MASTER being the only characters which connect all distant clusters, and being some of the most central characters by page rank.
Our community analysis delineates each doctor and their companions very well, with both Doctor13 and Doctor7 being clearly separated from all other characters. Interestingly, we see that Doctors 11 and 12, Doctors2 and 6 and Doctors 9 and 10 belong to the same community, as their companions carried over between “regenerations”.
A clear distinction between classic and modern eras is also visible, with the three modern era clusters separating to the left.
By tracking which characters appear together in each episode, we can construct an episode overlap timeline.
Looks good! We can see how long each companion lasts and which other characters they frequently overlap with. It’s also clear when previous doctors pop back up for a quick re-appearance. It’s also clear to us where some episodes contain many more characters than others. Neat.
One of the most common ways to get an overview of text-based data is to create word clouds, where the most frequent words in a script are represented graphically, with the size of each word corresponding to how frequently that character uses it.
Let’s try and see if we can create an R function that will generate a word cloud for a given character. We’ll be relying heavily on the package “tm” to achieve this.
wCloud <- function(character_name){
characterLines <- allEps_for_docs %>%
filter(Character == character_name)
characterWords <- VCorpus(VectorSource(characterLines$Script))
#Now, we'll want to clean this text in a few ways, such as removing numbers, "stop" words (Such as can, and, the), remove any and all punctuation and remove all the white space and capital letters
characterWords_c <- characterWords %>%
tm_map(content_transformer(tolower)) %>%
tm_map(removeNumbers) %>%
tm_map(removePunctuation) %>%
tm_map(stripWhitespace) %>%
tm_map(removeWords, c(stopwords_en,stopwords("english"),"just","thats","dont","got","can","now","one"))
#Now that we have clean text, the next step is to generate a document-term matrix
character_term_mat <- as.matrix(TermDocumentMatrix(characterWords_c))
words <- sort(rowSums(character_term_mat),decreasing=TRUE)
character_df <- data.frame(word = names(words),freq=words)
wordcloud(words = character_df$word, freq = character_df$freq, min.freq = 1,
max.words=200, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"), scale = c(3,0.25), main = character_name)
}
Let’s take look at how some of those turned out. Let’s check DOC9 and his companion, ROSE:
Great! But other than the reference to “ROSE”, it’s unlikely we’d really be able to tell these two word clouds apart from any other character.
If we want a deeper analysis of character-specific language, we might want to process our input text a little further, and use “Inverse Document Frequency” to try and find words that one character says often, but are not otherwise frequently said throughout the rest of the text.
One extra step we will perform is to reduce words to their individual wordstems, for example: Dancing -> Dance and Houses -> House
For this slightly heavier duty word processing, we will rely on another R package, “quanteda”.
We will also aim to remove character names. Removing character names from this list is a tough decision - Really, this just gives us information about who a character spends time with, but it does mean we might inadevertently remove terms like “Dalek”, which might be interesting.
This is achieved easily via quanteda’s tfidf function.
Interesting! We see “haroon” pop up from DOC3’s attempts at speaking an alien language, we see “Shush” from DOC1’s attempts to keep his companions quiet. DOC5 references the tardis most often. Looking down the list, we see some classics - “Spoiler” and “SWeetie” for River. Rose, Martha and Donna all have the word “god” in their lists. Susan has “grandfather”, her common nicknme for the doctor, etc. Looks good!
One of the most common form of analyses for large scale text data is sentiment analysis. We’ll begin our sentiment analysis in python and import the data below for visualization.
For this analysis in python, we’ll be using the NLTK package and the VADER sentiment analyzer.
This process involves stripping all punctuation from the text, separating the script into lists of strings, giving a PoS (Part of Sentence) tag to each word in all script lines, such as “noun” or “verb”, applying a lemmatizer to the newly tagged words and finally applying the VADER sentiment analyzer to each tagged word in each line.
Generally, sentiment analysis might be best applied to product reviews or tweets, which are less likely to have complex language than a TV script or a book. That may make this approach a little imperfect, but we need to do most of this data prep for downstream ML applications. In this case, sentiment is measured from -1 (Most negative) to +1 (Most positive).
Let’s start by looking at the ten most positive lines we can find in the text:
Wow! Very positive. I’m full of hope.
And the ten most negative lines:
Kill, kill, kill, pain, pain, pain. Evil robots malfunctioning - What could be more negative than that?
How about the sentiments of individual episodes? Individual characters? Scenes?
To try and make sure we don’t select edge cases/outliers, we’ll try to limit characters and scenes to those that have/contain more than just a few lines.
| Var.1 | epSentiment | Character | charSentiment | Scene | sceneSentiment | |
|---|---|---|---|---|---|---|
| 331 | The Romans | 0.1481757 | DOCTOR1 | 0.2066445 | S2E7:[Tardis] | 0.3886641 |
| 330 | The Celestial Toymaker | 0.1377142 | WILF | 0.1379338 | S31E13:[Amy’s bedroom] | 0.3843333 |
| 329 | “Vincent and the Doctor” | 0.1358523 | CHEN | 0.1279673 | S38E11:[Office] | 0.3840308 |
| 3 | The Brain of Morbius | -0.0326745 | RANI | -0.0217244 | S31E1:[Hospital corridor] | -0.2693364 |
| 2 | “Before the Flood” | -0.0421098 | DALEK | -0.0534310 | S20E1:[Amsterdam - Amstel Sluize] | -0.3051385 |
| 1 | “Heaven Sent” | -0.2012683 | BLACKDALEK | -0.0550911 | S9E4:[Varan’s village] | -0.3175000 |
Quite surprising that DOC1 has such a high mean sentiment! With him having so many lines, he really had to be consistently positive to make the cut. Looking through his speeches we see references to good, love and he has a very high frequency for the word “yes”, all very positive sentiments. As for K9, this is perhaps a little surprising, but his references to “Failure”, “Danger”, “Dead” and his high frequency for “Negative” and “No” leave him with a very low mean sentiment. “Dalek” and “Black Dalek” being bottom of the list is no surprise!
Sentiment, if it’s as accurate over the entire dataset as it is in the extremes above, might be a good predictor of episode rating or viewership. We can check this using a simple linear model.
We do have one sentiment outlier, the episode “Heaven Sent” has a negative sentiment five times greater than any other and it’s probably worth ignoring for the linear model.
Well, we see a clear correlation between viewership and rating, as
expected, but no clear correlation between sentiment and rating.
What about character sentiment? Does the doctor’s sentiment correlate well with other characters from the same episode?
Well, no correlation between character sentiment and doctor sentiment,
and while we’re here we can also test if the doctor’s sentiment alone is
correlated with viewership or rating, and we see that it is indeed very
slightly correlated with rating (p=0.006), though this is largely driven
by the two lowest sentiment episodes, which also have very high
ratings:
However, “The Magician’s Apprentice” has a surprisingly low viewer count, suppressing this effect in viewership. More than likely these correlations are totally spurious. In the next section, we will attempt to see if we can learn anything further from these futures by attempting regression on a set of features which includes the above.
In this section, we will attempt to implement a naive bayes machine learning algorithm as a classifier for a set of characters with a large number of lines. Given a script line as input, our classifier will attempt to return as output the character who is most likely to have said that line, or a similar line. We’d like to see both a high accuracy in classifying real script lines the classifier has not seen before, and a high subjective accuracy in classifying newly written lines.
Again, due to convenience, most of this analysis will be carried out via python, using sci-kit learn for the implementation of the multinomial naive bayes algorithm itself.
We will run our algorithm both based on the TF-IDF adjusted frequency, and only consider characters with a substantial number of words, over 5000.
Okay, so 49% accuracy isn’t great, and is likely the result of our relatively poor class balance and lowish number of lines, in addition to the fact that many lines are quite short. Imbalance in our classes’ (The doctors) lines leads to a very large bias in our weights, that is visible in our low precision. Still, with 7 categories it is much better than a random guess. Hyperparameter tuning was performed, but was only able to generate a ~0.5% increase in accuracy.
Most of our bias is in Doctor4. See the 0.39 precision and 0.75 recall above? Our classifier is generating a large amount of Doc4 false positives - Perhaps if we remove him we will remove some of our bias at the cost of losing this potential prediction?
A moderate 4% increase in accuracy, even after tuning for the best value for alpha, hardly seems worth it. We might attempt to use other models, such as a LinearSVC or a RandomForest or Logistic Regression. Let’s see what accuracy we manage to work our way up to after doing a brief tuning on each model:
## Naive_Bayes(All) Naive_Bayes(no4) LinearSVC(All) LinearSVC(no4)
## 0.49 0.53 0.50 0.53
## RandomForest(All) RadnomForest(no4) LogReg(All) LogReg(no4)
## 0.43 0.46 0.48 0.51
No improvement.
Attempting to use n-grams, where we would consider combinations of 2 or 3+ words, was largely ineffective in increasing accuracy of the model. Using n-grams between 1 and 3 resulted in a 52% accuracy, while (1,2) n-grams resulted in a 54% accuracy.
Not great! The best result was our reduced naive bayes classifier using (1,2) n-grams at 54%. It is fairly likely that there aren’t enough differences in the doctor’s individual speech patterns to generate a clear classification. Again, 54% is really not ideal. Furthermore, the above classification approaches, even using TF-IDF and n-grams, are context ignorant of the meaning of words, and many phrases with potentially opposite meanings may be interpreted very similarly by our classifier. If we wanted to develop a more accurate classifier, the majority of the work would likely be in reducing the total feature space using a manually curated dictionary and performing intensive optimisation of paramters.
While we have 12 doctors (A large number of categories considering our data size), we only have 2 eras. Employing similar strategies to those above, we might be able to generate a more accurate classifying model that can accurately predict the era (Classic or modern) a line is likely to have originated from. We’ll go through similar steps as above, testing logistic regression, naive bayes, randomforest and SVM approaches with hyperparameter tuning.
We must first subsample our data, as vectorising 240,000 lines is too much for my machine to handle. We will use 30,000 lines from each of the modern and classic eras as our data source.
From our initial test, we identify that the most effective classifier was based on logistic regression (At 68% accuracy baseline) and the worst performer was a random forest classifier (At 62% accuracy at baseline). Many attempts were made to try and increase the accuracy above 80%, but unfortunately the maximum accuracy achieved was 72%. This was achieved using un-scaled, count vectorized data, selecting for the best 10,000 features from (1,1) ngrams.
When testing the model against the un-subsampled data, we again get a 72% accuracy, with a much higher precision for classic texts. Oh well! Code for the generation of this model, and all others, is available via github.
Machine learning is frequently used in an attempt to predict ratings of enjoyment of episodes/books/tweets/search results using previous ratings data. In this brief section we will attempt to train a logistic regression model to predict an episode’s rating based on a feature space we can construct from our dataset.
In this case, unlike our other machine learning feature spaces, we are not using character dialogue as our input and must design and select our own features. In order for our model to eventually converge on a reasonable predictive accuracy, picking features that do not contain too much noise while also explaining some of the variance in episode rating is likely the most important step.
Features that we will choose are as follows:
Two of these features, writer and director, are categorical. As such, we’ll need to use a way to encode them for input into the middle. Another concern with these two features is that there are many writers/directors with a low episode number, just one or two. It is likely beneficial to group many of these together under an “Other” category.
We’ll attempt to create our regression model using the python sklearn toolkit, and we’ll test out both the Lasso and ElasticNet models.
We have used target encoding to encode our nominal categories, and after hyperparameter tuning of the parameter alpha using K-fold cross validation, we manage to achieve a MAE of roughly -0.608. Applying this to test data the model has not seen before, we achieved a 0.806, with an r2 of 0.25, which is not incredible, but would probably allow us to differentiate episodes between the worst episodes and the 9/10s. Let’s examine some of the coefficients of our Lasso model:
## Director Writer Runtime Characters Locations Scenes
## 0.687590 0.748027 0.046891 0.034326 0.047393 0.000000
## Lines WordsPerLine
## 0.048360 -0.146544
Perhaps unsurprisingly, writer and director dominate. This is essentially what we might have imagined given our previous exploratory analysis and the clear distinction in mean-rating for each writer (and each director). A small negative coefficient of words per line is somewhat interesting. Clearly, however, the driving factors and best predictors behind rating are the director/writer, weighted by the success of their previous episodes.
Text prediction is considered the continuous generation of next-word predictions produced from the trailing words in a script line. We’ll start by training our model on a single character. For this use of LSTM, a character with a very recognisable speech pattern is useful (and fun), so our first target will be one of Dr Who’s central villains, the Dalek.
Using the Dalek has one major advantage - We are very likely to be able to recognise if the neural net is producing lines that sound “Daleky”, but this comes at the major cost of potential repetition. Daleks always say “exterminated” after “you will be”, and often repeat this phrase in a single line. This is likely to lead our generation algorithm to get stuck in a loop if it is trained poorly.
Our neural network is designed as follows:
On our first epoch, the categorical crossentropy loss function reported a loss of 6.4, reducing to a loss of ~1. After model generation, we provide 5 seed phrases and use the model to generate more words until it reaches a length of 10 words.
When running the model for 100 epochs, we note that our loss function is still reducing with successive runs, which means that further training may be a possible strategy for improving generation.
Hm, those sound… Okay. “You are a traitor to the daleks you must be exterminated” is a direct quote from S9E1 and “I am a soldier…” is also roughly similar to direct line from S27E6. It seems like we have a bit of overfitting happening and, as above, our loss function is still reducing so we could attempt the use of more epochs. As such, we are likely to see some improvement if we tune some of our hyperparamters, specifically the number of epochs (Let’s take 200) and the number of sequential nodes of our LSTM (We’ll also go for 200). Here we can see the change of our loss function over time, and the texts produced based on our seed phrases.
Some of the phrases here are identical, so it’s probable we cannot generalise past them without a massive increase in training data. We also see a slightly peculiar loss trajectory at roughly 125 epochs. This is almost certainly due to some kind of large batch variation. Our batch size was 32 (As default), and most likely increasing the batch size would help to solve this problem.